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Abstract 

We present a sparse knowledge gradient (SpKG) algorithm for adaptively selecting 
the targeted regions within a large RNA molecule to identify which regions are most 
amenable to interactions with other molecules. Experimentally, such regions can be 
inferred from fluorescence measurements obtained by binding a complementary probe 
with fluorescence markers to the targeted regions. We use a biophysical model which 
shows that the fluorescence ratio under the log scale has a sparse linear relationship 
with the coefficients describing the accessibility of each nucleotide, since not all sites are 
accessible (due to the folding of the molecule). The SpKG algorithm uniquely combines 
the Bayesian ranking and selection problem with the frequentist i\ regularized regression 
approach Lasso. We use this algorithm to identify the sparsity pattern of the linear model 
as well as sequentially decide the best regions to test before experimental budget is 
exhausted. Besides, we also develop two other new algorithms: batch SpKG algorithm, 
which generates more suggestions sequentially to run parallel experiments; and batch 
SpKG with a procedure which we call length mutagenesis. It dynamically adds in new 
alternatives, in the form of types of probes, are created by inserting, deleting or mutating 
nucleotides within existing probes. In simulation, we demonstrate these algorithms on 
the Group I intron (a mid-size RNA molecule), showing that they efficiently learn the 
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correct sparsity pattern, identify the most accessible region, and outperform several other 
policies. 

Keywords: sequential decision making; knowledge gradient; bayesian statistics; sparse 
additive models; RNA molecules 


1 Introduction 


In recent years RNA has been rediscovered as a potent 


implications to bio tech nology and human health 


2010, DeVos and Miller, 


2013 


Chan et al 


drug 


Vazquez-Anderson and Contreras, 


target with important 


2006 


2013 


Benngtt and Swayze, 


Ihlaning 


et al. 


2015], 


Learning the structure of RNA molecules has become important in health research to improve 
the understanding of the interactions between RNA molecules and drugs. In addition, RNA 
regulates essential cellular processes through specific interactions with other biomolecules 
(e.g. proteins, other RNA or DNA molecules, etc). Disruption of an otherwise natural 
molecular interaction can potentially cause diseases. RNA can fold into intricate tridimen¬ 
sional structures making some regions accessible to interact with other molecules, while other 


i biochem ical te chno logy h as t aken huge leaps by mak- 


Gelderman and Contreras, 2013], significant progress is 


regions remain inaccessible. Althoug’ 
ing RNA sequences readily available 
still required to understand RNA structure. The scientists on the team have made it possible 
to determine accessible regions using a fluorescence-based system [Sowa et all 120151 ], where 


an RNA strand, hereafter referred to as a probe, interacts with a specific complementary 
region within a target RNA generating fluorescence (see Figure [1]). This fluorescence directly 
correlates to the accessibility of a given region within a target_RNA mol ecule. Although the 
in vivo RNA Structural Sensing System (iRS 3 ), as named by S owa et al. [20151 ]. is a valuable 


tool as it provides the accessibility of a segment of interest within an RNA molecule in living 
cells, synthesizing and running the experiments in the absence of any apriori information 
of the RNA can be expensive and time-consuming. With the purpose of expanding its use 
to characterize a full molecule, we undertake the endeavor of optimizing the experimen¬ 
tal settings of the iRS 3 . This paper seeks to use the knowledge gradient policy, adapted 
to a high-dimensional, sparse linear model to maximize the information gained from each 
experiment. 

Our work addresses the problem of sequentially guiding experiments to identify the ac¬ 
cessibility patterns of an RNA molecule known as the “Tetrahymena Group I intron” (gl 


has been extensively characterized 


Kieft and Tinoco 

1997 

2004. 

Wan et al.. 

2010, 


Cech et al.. 

1981, 

Krueer et al.. 

1982. 

Cech et al., 

1994, 


Golden et al 


1998, 


Vazquez-Anderson and 


Russell et al. 
ContreraX 


2002, Koduvayur and Woodson, 


201 3]. Determining these accessi¬ 


bility patterns is difficult to do in silico, as the y dep enc 
molecule known as the tertiary structure 
details on in silico approaches, see 


Scherr et al 


on the comp lica ted folding of the 


2000 


Miickstein ct al 

Vazquez-Anderson and Contreras 


20061 ]. For 


2013 ], Experimentally, 


such accessibility patterns can be inferred from fluorescence measurements obtained from 
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Figure 1: Illustration of the in vivo RNA Structural Sensing System (iRS 3 ) 


Notes. A complementary probe with fluorescence marker is synthesized to bind to the 
targeted region of an RNA molecule. If the probe targets an accessible region (left), an 
interaction will occur leading to a strong fluorescence signal. Otherwise, if no major 
fluorescence signal is detected in the presence of the target RNA, it is interpreted as no 
interaction between the target RNA molecule and the probe (right). 


the iRS3 by using various com plementary probes designed a priori to target a region within 


the intron Sowa et ah, 120151] . However, the number of variations of the probe increases 
quadratically in the number of nucleotides; therefore, the number of candidate probes is 
usually extremely large. A critical problem is therefore deciding which targeted regions 
should be tested, especially given the time and cost to perform each experiment. 

The problem of identifying the accessibility pattern of an RNA molecule can be modeled 
mathematically as a ranking and selection (R&S) problem. By limiting the length of the 
probes, we are confronted with a collection of targeted regions of the RNA, which we call 
alternatives in R&S. In this R&S problem, we have a budget of measurements that we need 
to allocate sequentially to test the alternatives. As more information is collected, the belief 
distribution is changed or updated by conditioning on all the observations we have up to this 


3 






















point. Our goal is to maximize our ability to gain valuable information and the reward until 
the budget is exhausted. Although there are many papers on R&S problems, only a few make 
use of the model structure of the underlying belief model. The classical model for R&S is the 
lookup table belief model, which does not assume or exploit any structure. In identifying the 
accessibility pattern of an RNA molecule, where we may have tens or hundreds of thousands 
of alternatives, the lookup table strategy is computationally intractable. It may also be 
inappropriate when our goal is also to learn about the underlying model structure. 

In this paper, we use a thermo-kinetic model which represents the log fluorescence level 
as a linear model of the weight coefficients representing the accessibility of each nucleotide. 
This coefficient vector is of the same dimension as that of the target molecule and thus can 
be high-dimensional. However, it is typically the case that only a small portion of these 
coefficients contain explanatory power, because not all sites are accessible due to the folding 
of the molecule. In such cases, a sparse linear model can offer considerably more flexibility 
than a linear model without sparsity structure. Therefore, we develop a spa rse knowledge 
gradient (SpKG) policy for sequential experimental design 


Li et al 


20151 ]. This policy 


combines Bayesian R&S with a frequentist l earning approach for recursive Lasso (Least 
Absolute Shrinkage and Selection Operator) 


Tibshirani. 


1996, 


Garrigues and El Ghaoui, 


2008.1 Ch en and Herd . 2012]. which is a well known l\ regularized version of least squares. 


In this paper, we perform thorough testing of the SpKG algorithm using the setting 
of RNA accessibility identification. However, the SpKG algorithm is broadly applicable 
to learning problems with high-dimensional belief models, a problem domain that has been 
attracting considerable attention. Furthermore, SpKG can also be applied to problems with a 
known sparse group structure, as might happen when variables exhibit a natural clustering. 
For example, applications in health might exhibit clusters of variables related to specific 
medical conditions. In add ition, it can also handle interaction terms with smoothing spline 
ANOVA models 


see 


Li et al. 


2015 ], 


It is worth noting that the SpKG algorithm is a unique and novel hybrid of Bayesian R&S 
with the frequentist l\ re gularized regres s ion known as Lasso. Both have been extensively 
studied in different fields Friedman et al.. 2001 


Powell and Rvzhov, 2012]. For example, as 


a regularized version of least squares, Lasso minimizes the residual sum of squares subject 
to the sum of the absolute value of the coefficients being less than a constant. As a result 
of the penalty term, many of th e co efficients will be exactly zero for large problems. Since 
Lasso was first proposed by Tibshirani 19963 ]. there has been a considerable amount of work 


exploring the use of Lasso in different settings. It is useful in many settings due to its 
tendency to prefer solutions with fewer nonzero parameters, effectively reducing the problem 
dimension. For this reason, Lasso and its variants, such as elastic net regularization, are 
fundamental to many high-dimensional regression models. 

Most of the previous work is in the classical batch setting, where we are given a dataset 
from the beginning, with no control over how the observations are chosen. However, our 
application requires guiding experiments in an online fashion. It leads to the acquisition 
of new information about the environment which may improve future decisions. For this 
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with an T'i.oo group Lasso penalty. In this work, they consider a more general group sparsity 
system, which is composed of a few known nonoverlapping clusters of nonzero coefficients. 
The coefficients among each group have some correlation and are either all selected or not. 
Our work takes Lasso estimator updated from the homotopy algorithm as a sample from 
the true distribution of the coefficients. Then we use this to update both the conditional 
normal distribution of the coefficients and the Beta-Bernoulli conjugate distribution of the 
probability distribution of whether each coefficient is selected or not. 


rections: offline learning, widely known as ranking and selection (R&S) 
and online learning, often referred to as the multiarmed bandit problem 
Either problem can be approached using frequen tist o r Bayesian approaches. For example, 


Swisher et al. 

2003 

Gittins et al. 

2011 


optimal computing budget allocation, or OCBA Chen, 2010, Chen et al.. 1201 2]. is a frequen¬ 
tist approach developed within the simulation optimization community for finding optimal 
designs that are tested using Monte Carlo simulation. Upper confidence bounding (UCB) 
policies, widely studied in the machine learning community for online bandit problems, iare 
often approached as distribution-free strategie s that enjoys bou n ds on the number of times 
that the wrong alternative might be tested 


Auer et al 


2002, 


Bubeck and Cesa-Bianchi, 


2012], Frequentist approaches tend to require that each alternative be tested at least once. 

A substantial literature has gro wn up around the general strategy of maximizing the value 
of information. Gupta and Miescke 199ft] first proposed this idea for the offline ranking and 


selecti on problem, an idea that has been pursued under th e name of the knowledge gradient 
(KG) Frazier et all 120081 . 120091 . IPowell and Rvzhovl . 12012 ], Approximations of this idea have 


been proposed under names including sequenti al kriging optimization (SKO ) 


200611 . and efficient global optimization (EGO) 


.Tones et al. 


1 99h! . IBuIII . 1201 1 ]. 


Huang et al 


Ryzhov et al. 


20121 ] shows that the knowledge gradient can be easily applied to both online and offline 


problems. Furthermore, a batch KG policy based on Monte Carlo is proposed in the decision 
problems where the experimentalists may be able to run several parallel experiments in 
batches by maximizing the value of information for an entire batch Wang et al., 2013 ]. 


Value of information policies are particularly well suited to problems where experiments 
are time consuming and expensive. The idea is particularly powerful when we can exploit 
belief models that capture some of the underlying structure of the problem (for example, 
linear belief models). This is the setting we face in this paper. 

Identifying and validating RNA structures has been a problem of interest for the molecu¬ 
lar biolo gy c ommunit y, especially si nce the catalytic properties of RNA were discovered back 
in 1982 


Cech et al. 


1981 


Gold et al 


19951 ]. Unveiling the structure of an RNA molecule 


is critical to the understanding and exploitation of the interactions established with other 
RNA molecules. In this context structural accessibility becomes a central object of study. To 
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this purpose the same experimental methods mentioned above have been used to understand 
this phenomenon, and a number of compu tati onal algorithms have been developed to iden¬ 
tify and characterize RNA interactions [Pain et al.. 120151 ]. A semi-empirical thermodynamic 


model common to a lot of the computational algor ithms available to study RNA structure 
is the nearest neighbor method SantaLucia. 11998 ]. A rece nt example of an algorithm to 


explain RNA-RNA interactions is the one proposed by Rodrigo et ah 20131 ]. This paper has 
proposed a thermo-kinetic model including both Gibbs free energy and a kinetic function 
considering an intermediate stage (known as the seeding interaction). Another important 
strategy to predict and understand RNA structure is the use of the partition function and 
other stochastic methods. In this paper, we use a novel semi-empirical (since it uses experi¬ 
mental DMS footprinting data) thermo-kinetic model based on the nearest neighbor model 
parameters Xia et ah, 199.8]. We design a sequential experimentation policy based on max¬ 


imizing the value of information (the knowledge gradient) using a Bayesian belief model. 
We showcase this strategy in the context of the setting of characterizing the structure of an 
RNA molecule using fluorescent probes. 

This paper makes the following contributions. (1) We provide a sparse additive belief 
model for the fluorescence level produced by a probe applied to an RNA molecule. Here 
the log fluorescence level is a linear combination of the weight coefficients describing the 
accessibility of each nucleotide. The coefficient vector is sparse because not all sites contribute 
to the thermodynamic binding process. (2) We derive the batch SpKG policy which generates 
several suggestions sequentially to run parallel experiments. (3) We then introduce a new 
length mutagenesis procedure where new alternatives, in the form of types of probes, are 
created by inserting, deleting, or mutating nucleotides within existing probes. Then by each 
experiment we enlarge the alternative library by adding in the one with the highest value 
of information from the larger library generated by the length mutagenesis procedure. (4) 
We demonstrate the effectiveness of the SpKG policy, with length mutagenesis and batch 
learning, in the setting of selecting probes to maximize fluorescence (as an indication of 
identifying accessible region) for RNA molecules with hundreds of potential sites. 

The remainder of the paper is organized as follows. Section [2] describes the ranking and 
selection problem and the biophysical model. In section 3, we introduce the KG policy with 
ja (nonsparse) linear belief model, and then describe the sparse linear model proposed by 


Li et al. 


201 5], Then we extend the SpKG algorithm to handle batch experimentation, as 


well as the new length mutagenesis procedure. Section 0] reports on the application of the 
procedure to the in vitro DMS footprinting data with the RNA molecule Group I intron. 
Section [5] concludes the paper. 


2 Model 

We begin by considering a Bayesian R&S model where we have M alternatives. Let X 
be a finite set consisting of the M alternatives and ji x : x £ X > M be a mapping from each 
alternative to its value. We have a budget of N measurements, and we wish to sequentially 
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decide which alternative to measure so that we can find the best alternative when our budget 
is exhausted. Let fi = \ji i,... ,^m] T (Table [1] provides a summary of the notation used in 
the paper). We assume that /x follows a multivariate normal distribution: 

/(0,£). (1) 

Table 1: Table of Notation 


Variable 

M 

X 


l^x 


N 

x i /x i 

y n+1 /y2 +1 




e n , s n 

n 

v 




OLk 

Cj 

(£7 > ) 

d n 


K G,n 


I< 

B 


Q 

( 0 k '\ s fc ' 6 ) 


Description 

Number of alternatives/testing probes 
Set of alternatives 
Unknown mean of alternative x 
Number of measurements budget 
Column vector (yui,..., /im) T 

Sampling decision at time i (vector or scalar index) 

Sampling observation from measuring alternative x n 

Measurement error of alternative x n 

Known standard deviation of alternative x 

Mean and covariance of prior distribution on /x at time n 

Set of all possible policies 

Number of features/nucleotides 

Basis function for learning the local energetic value 

Linear transformation matrix 

Weight accessibility coefficient of nucleotide at site k 
Random indicator variable of aj 

Mean and covariance of posterior distribution on a after n measurements 
Set of shape parameters of beta distribution on p" 

Lasso estimate at time n 

Mean and covariance matrix estimator from Lasso solution at time n 
State variable, defined as the pair (0 n , E") 

Knowledge gradient value for alternative x at time n 
Number of possible sample realizations of £ 71 
Parameter of Bernoulli distribution on ( 

Number of batch measurement budget 
Number of batch experiments at each time 
Number of Monte Carlo simulations 

Mean and covariance of posterior distribution on fi after k batches and additional b measurements 
Mean and covariance of posterior distribution on a after k batches and additional b measurements 


Consider a sequence of N sampling decisions, x°, x 1 ,..., x N 1 . At time n, the measure¬ 
ment decision x n selects an alternative from set X to sample, and we observe 

^ +1 = Mx + e ” +1 , 

where e” +1 ~ and a x is known. At the beginning, we may think of /t as a 

realization of the distribution given in (jT]) , while the experimenter is only given some prior 
fx ~ E°). Throughout the experiment, the experimenter is given the opportunity to 

better learn what value /x has taken through the sequential sampling decisions. 
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For convenience, let J 7 " be the c-algebra generated by the samples observed up to time 
n. Note that we have chosen our indexing so that random variables measurable with re¬ 
spect to the filtration T n are indexed by n in the superscript. Following this notation, let 
0 n = E(/i| J~ n ), and = Var(/x| J 711 ). This means the posterior distribution on p, is also 
multivariate normal with mean 6 n and covariance matrix S”. Let II be the set of all T n 
measurable policies. That is II := {(x°,... ,x w_1 ) : x n € J - n }. Our problem is to find the 
policy that solves 


sup max 6% 

wen 


2.1 The Biophysical Model 

In the RNA accessibility identification problem, let T be a molecule comprised of RNA 
nucleotides, called the target molecule. Denote the target molecule sequence as 

T = 

where ti £ {A, C,G,U} are the individual nucleotides. RNA molecules range from a few to 
thousands of nucleotides, but the typical length is several hundred. For the specific group I 
intron RNA molecule we work with in this study, it contains p = 414 nucleotides. Depending 
on this sequence, the target molecule will fold upon itself in a thermodynamically favorable 
manner. The precise, three dimensional structure of this molecule upon folding is called the 
molecule’s tertiary structure. Particularly, identifying regions of a molecule most amenable 
to interactions with other molecules is important to understanding how such interactions are 
mediated. Such regions depend on the molecule’s tertiary structure. Those regions that are 
well protected in a mechanistic sense are less likely to interact with other molecules than 
those regions that are exposed. We refer to the regions more likely to interact with external 
molecules as accessible regions. Identifying such regions is accomplished by sequentially and 
adaptively selecting the sites of the target molecule to bind a complementary RNA probe 
reporter. The RNA probe reporter includes a sequence, which is typically 8 to 16 nucleotides 
in length. 

There is no precise definition of absolute accessibility of a region. However, we can think 
of the accessibility of a region relative to the accessibility of another region. This distinction 
is important experimentally. Therefore, to determine the accessibility of a region, the probe 
reporter also includes a fluorescent marker. The presence of the fluorescence at the end of an 
experiment, which the experimenter can measure optically, indicates whether the probe has 
successfully bound to the target region. The intensity of this fluorescence is an indication to 
how well this binding has occurred. 

As described above, one can think of an “alternative” as a specific region within the 
target molecule or a complementary probe and its “value” as the amount of binding or 
the fluorescence level synonymously. We now describe briefly the biophysical model that 
connects both through a linear model. 



Our main assumption is that the fluorescence measurements are a combination of mech¬ 
anistic accessibility (kinetics) and change in Gibbs free energy between bound and unbound 
states (thermodynamics). To model this, we consider the accessibility profile 

CX ( Oi \ , . . . , CXp ) , 


of the target molecule, where a/, is a weight describing the relative accessibility of nucleotide 
k in the target molecule. This profile is generally unknown and can be estimated through 
some experimental data. Then the fluorescence intensity to target region [i, j] can be modeled 
as 


(-£>1 p ' 

Mbind (i,j) ■= log-j—y = MiJ) + ^2 a k(j>k(i,j), 


k =1 


(2) 


where Mbind(b.7) represents the amount of binding to the target region [U ] and [B] 

denote the fluorescence intensity of the bound and unbound states, respectively. 
is a base energy gradient value for attempting to bind to region and is the 

local energetic contribution of the fc-th nucleotide position. Given the target molecule and 
under some assumpti ons, the val ues </ >&,(£. 7 ) are known. It is the energy as measured for 


Xiaetah 


1998 ]. 


Watson-Crick Helices = _ a 

As mentioned above, the accessibility profile vector a = [a\,...,a p ] T is sparse, which 
means that many accessible values are zero or near zero. Mechanistically, this is a reasonable 
assumption, as we expect the tertiary structure of any sufficiently large molecule to be well- 
folded, meaning the proportion of exposed, mechanistically accessible regions to protected, 
inaccessible regions to scale like surface area to volume. Experimentally, the prior estimate 
from the experimental footprinting data also shows such property as one can see later in 
Section HE Therefore, this model as shown in @ is a sparse additive model. 


2.2 The Bayesian Sparse Additive Model 

The above model shows that the amount of binding /x is linear in the weight coefficients 
a representing the accessibility of each nucleotide. Here one can view a “feature” or an 
“attribute” as the accessibility of each nucleotide. Furthermore, we let represent 

the region for the m-th alternative and /x m represent /Xbind j^), then we can write m 
into the following affine system 


/w\ 


/ (/>i(z (1) ,z (1) ) j (1) ) ••• <Ap(* (1 \/ (1) ) \ 


/ 0]\ 


/ </>o(* (1) ,i (1) ) \ 

M 2 

= 

<M* (2) ,j (2) ) 4>2 (* (2) ,j (2) ) ••• <^p(* (2) ,/ (2) ) 


OL2 

+ 

</>o(* (2) ,/ (2) ) 

\MM/ 


U(* (M) ,j (M) ) ••• 


\ a v) 




Here if we write the basis matrix as € M Mxp and the intercept vector as $ 0 , the above 
linear equations can be written in the matrix form: /x = $o: + $ 0 . Since $0 is known, we 
can assume $0 = 0 without loss of generality. Thus we have 

fi = &a, ( 3 ) 
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where p, and a are random variables. We know that ot is sparse in the sense that most of 
its components are zero. In the Bayesian setting, whether each feature is zero or not is also 
random. Specifically, let £ = [Ci, • • •, Cp] £ be the indicator random variable of a, that is 


C? 


1 if aj ^ 0 

0 if aj = 0 ’ 


for j = 1,... ,p. 


Additionally, conditioning on we assume that a follows a multivariate normal distribution 
with mean $ and covariance S 15 , that is 


a | C 


Without loss of generality, conditioning on £, we can permute the elements of c* and partition 
a into the nonzero part and the zero part, so a 1 = [(cc^) 7 ’, 0] . Besides, conditioning on £, the 
components of $ and are only nonzero where indexed by S. Furthermore, conditioning on 
£, we get that p follows a multivariate normal distribution through the linear transformation, 
that is 


Combining this with (fT|). we have 


e = 

£ = $£^ r . 


The linear model in Q allows us to maintain the belief model in the parameter space 
rather than a look up table belief model in the alternative space. In the case that the 
parameter structure is sparse, we use a frequentist learning approach (Lasso) which uses 
a least squares regression with an i\ regularization penalty, to update the belief model. In 
order to do this recursively, we introduce Beta-Bernoulli conjugate priors on each component 
of £. Specifically, at time n, we have the following Bayesian model, for j, j 1 = 1,... ,p, 


a | C n = 1 ~.A/'(0 n ,£*’ n ), 

(4) 

(j n | p] ~ Bernoulli (p" ), 

(5) 

C'' kf for 

(6) 

p?|#,7#~Beta(g*,»tf), 

(7) 


where p r - is the probability of the j-th feature being in the model, and (£”, r/") are the 
shape parameters for the beta distribution of p r -‘. We assume that and (■', are inde¬ 
pendent for different groups j and j'. Then after we get the new measurement ( x n ,y n+1 ), 
w e re cursively u pdate the current Lasso estimate from r9 n to by the algorithm in 


Garrigues and El Ghaoui 2008 ] and sample a covariance matrix £ 1? ’ n+1 from the first order 


optimality condition. For notational simplicity, we let and £^’" +1 denote the nonzero 

parts, leaving the S superscripted by time n + 1 implicit. If we regard the Lasso estimate 
as a sample from the conditional multivariate normal distribution, we can use the following 
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heuristic updating scheme for a Beta-Bernoulli model and a normal-normal model. The 
updating equations are given by: 


or 1 




^5 

■^ 19 , 71+1 

q +1 = q +1, 

pn -\-1 pn 


#T 1 + (=. 


T?,n+1\—1 
S > 


-1 




< +1 = r,l 


for j € 5, 


_7l+l 


= Vj + 1, for j ^ 5. 


( 8 ) 

(9) 

( 10 ) 

( 11 ) 


To illustrate, one can think of the hyperparameters (£j, r/j ) as the frequencies of “in” and 
“out” for each attribute and are updated through the Lasso estimates. Therefore, £”/(£” + 
rfj ) can be viewed as approaching the probability of the j-th feature being nonzero as n 
becomes large. If the Lasso estimators can correctly recover the sparsity pattern asymp¬ 
totically, then our approach should also identify the accessible nucleotides as the sampling 
budget goes to infinity. Theoretically, we have shown that our posterior mean estimate 
converges to the truth asymptotically under some conditions 


Lietah 


20151] . 


3 The SpKG Algorithms 


Before introducing the SpKG algorithm, we first briefly review the knowledge gradient 
policy with both a lookup table belief model and a nonsparse, linear belief model. The 
knowledge gradient policy for correlated beliefs (KGCB), as introduced in Fr azie r et al. 20091 ] 


is a fully sequential policy for learning correlated alternatives. At each time n, it makes the 
decision to measure the alternative with the largest expected incremental value, which is 
defined as 


v KG,n = E ( r 




and 


x KG ’ n = argmaxr x KG ’ n , 

x£X X 


where the knowledge state S n is defined as S n := ( 6 n , S n ). The KG policy can be viewed 
as a gradient ascent algorithm. Maintaining a multivariate normal belief on the alternative 
space, we can update our belief by the Bayes rule and the Sherman-Morrison formula. Taking 
x n = x to simplify subscripts, the updating equations are 


6 n+1 

£ n+1 


6 n + 



n+l _ QTL 
Ux - 

a x + ^xx 

V n e x elV n 

rr‘2 i yn ’ 

U X ' ^xx 


Xl 


( 12 ) 

(13) 


where e x is the standard basis vector with one indexed by x and zeros elsewhere. We can 
further define a vector-valued function a as 


5(S n ,x) 


S n e x 

V a x + ^ XX 
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and define a random variable 


z n +i = 


iUx + 1 - 0 n x) 


v / Var[^+ 1 -^|^]’ 

which is a one-dimensional standard normal random variable when conditioned on T n 


Frazier et al 


2008 ]. Then we can write (fl2l) as 

0 n +i = Q n + a{'E n ,x n )Z n+1 . 


This allows us to compute the KG value as 


f G ' n = E(max0”, + a x ,('E n ,x n )Z n+1 \S n ,x n = x) - max^, 


x'ex 

= h( e n ,a{i: n ,x)). 


x'ex 


(14) 


Here h : W V1 x i —> R is defined by h(a,b) = E[max ? ; a,; + biZ] — rnaxj a*, where a and 

b are deterministic M-dimensional vectors, and Z is any one-dimensional standard normal 
random variable. 


Frazier et al. 


20091 ] provides a method to compute h(a,b ) with complexity 


0(M 2 log M). 

In the case of a linear model when /x = and we do not hold sparsity belief on a, 
there exists recursive least squares (RLS) updating equations for ($ n , Elthat are similar 
to the recursive updating in (fT2l) and (fl3l) . Before providing the updating equations, we first 
introduce some additional notation. Let 4> n = [^",..., 0p] 7 be the column vector describing 
the alternative that is measured at time n after the basis transformation cj). Then, the 
following updating equations result from stan dar d expr essions for normal sampling of linear 


combinations of features |see 


Powell and Ryzhov, 2012, p. 187]: 


tf n+1 = R n + 


£n+1 


i#,n+l _ 


= s 


i?,n 


^,n,n 








(15) 

(16) 


where e ,l+1 = y n+1 — (^”) T 0 n , and = a 2 + (4> n ) J Ti' d,n (j) n . For the KG computation, 
we can just plug the linear transformation 6 = and El = thX 19 *!? 7 into equation (11411 . 
The linear model exponentially reduces the computational and storage requirements of the 
lookup table model. The essential idea is to maintain a belief on the attributes. A summary 
of the KG policy with a nonsparse, linear belief model is outlined in Algorithm [lj 


3.1 The SpKG Algorithm 

It is worth noting that, for the sparse linear belief, we introduce the random variable Q 
to indicate if the j-th attribute is selected or not. As a result, the KG calculation in m 
needs to be modified so that the expectation is also taken over £. 

Specifically, at time n, the Bayesian model is as described in dH)-®. The prior is 
a discrete random variable. Let C”’ 1 ,..., C, n,L be all the possible realizations of and 

P(C n = C n ’0 = P n,l J = 1,... ,L. To compute the KG value, we need to approximate the 
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Algorithm 1 Knowledge gradient algorithm with nonsparse linear belief 
Require: i9°, I] 15 ’ 0 , <f>. 

1: for n = 0 to N — 1 do 
2: for x = 1 to M do 

3: a <r- <f># n 

4: ^x,*/ V^x + 

5: v <— h(a , b) 

6: if x = 1 or v > v* then 

7: U* ■<— V, X* ■<— X 

8: end if 

9: end for 

10: X n = X * 

11: Get a new measurement: (x n ,y n+1 ) 

12: # n+1 .(— i?” %RLS update by (fl5l) (flHli 

13: <_ E 19 ’” 

14: end for 

15: return '3 N ,T,' l9 ’ N . 


distribution of (C” >P n+1 ) by that of (C n >P n )- This is because the change of the sparsity 

belief depends on the next observation and the Lasso algorithm, and thus can be very 
complicated to model. Therefore, by the Law of Total Expectation, the KG value can be 
computed by weighting over all the possible sparsity structures 


Li et al. 


2015]: 


v KG,n _ jg x (max #T +1 \S n , x n = x) — max 6™, 

X ,l x'GX X -r-'cV x 


x’&X 


E p nE^| p nE a e |^n )P n(max0” ; ' fl |S ,n , x n = x,C n iP n ) — max#]] 
L 

J^E pn (p n ' l )h(a n ' l ,b n ’ 1 ) 


i=i 

L 


e n 


s 3 


n 


i=i 


f n + n n 
{]■ C”’=1} 3 3 {j-Cj’ l =o} 


—— h(a n ’ l ,b n ’ 1 ), 
$ + rf> 1 A 


(17) 


where 


a 


n,l 


b n ’ 1 




n,l ) 

E 71,19 

£n,l 




Here h is the function defined in (ED- The subscript means the submatrix is taken 

with all the rows and columns indexed by £ n ’ 1 . The same notation is used throughout the 
paper. Since L can be as large as 2 P , we can sort the weights and approximate the KG value 
by only computing the ones with the largest probabilities. In that case, we approximately 
compute the KG value to avoid the curse of dimensionality. This is reasonable because we 
expect the sparsity pattern to converge as n becomes large. We summarize the SpKG in 
Algorithm [21 
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Algorithm 2 Sparse knowledge gradient algorithm 

Require: i9°, 5^’°, {£?, rf- } k =] , <&, N , regularization tunable parameters {A*}^L 0 - 
1: for n = 0 to N — 1 do 

2: Compute KG by (fTDk x n = argmaxrx G,n %compute h as in Algorithm 1 

3: Lasso homotopy update: •d n , (x n , y n+1 ) € R m x R, A n , A n+1 —> $ n+1 

4: Approximately simulate S^’ n+1 

5 : tf n+1 <- v °’ n+1 <- S’ 9 ’”, {Cj + 1 ,v] + 1 } p j =1 <- {£j,Vj } p j= 1 by (ED-(HU) 

6: end for 

7: return , nf} P j = i- 


3.2 The Batch SpKG Algorithm 

The sequential knowledge gradient policy fails to account for the ability of experimental¬ 
ists to run several experiments in parallel or in batches. Batch experiment means running 
a group of experiments at the same time. For example, the experimenter could divide a 
plate into squares, where each square is a different experiment, but the plate is immersed 
in a chemical bath at the same time. Doing experiments in parallel has the connotation of 
literally running different experiments on different machines at the same time. For example, 
in this RNA problem, the experimenter can synthesize three probes in parallel to test their 
fluorescence intensities all in one run. Our batch SpKG algorithm can_deal with _both “batch” 
and “parallel” settings. To handle such experimental settings, Wang et ah 2015] proposes 
a Monte Carlo based batch knowledge gradient (BKG) approach to guide the batch exper¬ 
imental design by maximizing the value of information for an entire batch with the lookup 
table belief model. In this section, we first review the BKG policy with lookup table belief 
model and linear belief model, after which we then derive the new Batch SpKG policy. 

To begin, we modify our notation to fit the batch measurements. Suppose we are given a 
batch measurement budget of I\ with B batch decisions at each time. Then the total number 
of measurements allowed is N = BK. Now at time step k, we choose to measure a batch of B 
alternatives x k,0 : x ktl , ..., x k,B_1 simultaneously and get B observations y k+1, °, ...y k+1,B ~ l . 
We also use superscript ( k , h) to index our beliefs. For example, the prior multivariate normal 
belief can be rewritten as (0 0,0 , H 0,0 ). The superscript (k, b ) is understood as meaning that we 
have done k batches and used x k, °,..., x k,b ~ 1 ,y k+1, °, ...y k+l,b ~ l to update our belief. Similarly 
to equations m and (1131) . the new updating equations can be written recursively for a batch 
measurements as 


e 


k, 6+1 _ 


y 


fc+l ,3 _ Q k ’0 




^/c,6+1 _ ^ k,b _ 


e k,o + ^ 

j =0 

V k ’»e xk , b (e xk , b ) T V k ’ b 


a "k.-i + ^ 








cr 


k,j 




k.b 


(18) 


(19) 


J X k 'bx k )b 

where k = 0,1,..., K — 1, b = 0,1, ...,B — 1, 0 k+1, ° = d k,B , and S fe+1, ° = T, k,B . It is worth 
emphasizing that in the batch setting the covariance matrix would be updated within a batch 
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since it is determined by the measurement decisions and is independent of the observations, 
whereas the mean values 6 k,b are only updated after the observations are collected for the 
whole batch. 

The batch knowledge gradient policy greedily adds in each alternative that maximizes 
the expected incremental value one at a time until B alternatives are chosen. The expected 
incremental value of measuring alternatives x \...., Xj at time step k is defined as 

,BKG (gk) = E [ max 0 fc+ 1 _ maK 6 k \x k ’° = x u ..., x^- 1 = Xj , 5 fc ]. 


v 


XI ,...Xj ' 


The batch knowledge gradient policy has the decision function 

T k,b yBKG fs ^ = „ r(TTrlflv ,,BKG / ak\ 

•E • "^6 IllclX V k,Q j.k,b— 1 ) 5 


which can be rewritten as 


X ® KG (S k ) = argmaxE 

x£X 


b -1 


max ( 0 kfi + a('Z k ’i,x k ’ j )Z k+1 ’i + x)Z k+1 > i 

3=0 


, (20) 


according to equation (11811 and (1191) . Notice that here x k,J , j < b are fixed when choosing 
x k,b , and ca n be updated within a batch according to (| 19 1) . Since an analytic expression 

for the expected maximization as in (1201) is unknown, Monte Carlo sampling is used to 
approximate the expectation. The pseudo-code of the BKG for the fe-th batch decision and 
the Monte Carlo algorithm are presented in Algorithm [3] and Algorithm [4) 


Algorithm 3 Batch knowledge gradient policy with lookup table belief for the fc-th batch 
decision_ 

Require: 6 k ’°, and the number of sample Q for the Monte Carlo simulation 
1: Use the KGCB policy to find x k, ° 

2: <x° <- 5(£ fc ’°,X fc ’°) 

3: Update H 6,1 according to (fl9l) 

4: for b = 1 to B — 1 do 

5: Use Algorithm 0] below to calculate u® fc K 0 G k,b-i x k,b= E 

6: x k ’ b = arg rnax ie ^ v x*^...,x k > b — 1 ,x k ’ i =x 

7: a b «- 5(£ n ’ b ,x n ’ 6 ) 

8: Update E n,fe+1 according to (fT9l) 

9: end for 

10: return batch decisions x k, °, x kjl , ..., x k,B 1 


Besides, the logic of the batch knowledge gradient policy can be generalized to a linear 
belief model. In this case, instead of recursively updating 6 k,b and Yi k,b directly as in (1181) 
and (1191) . we recursively update and for a batch of observations through RLS and 

use the linear transformation 6 = <l>$ and X = : 

b j 

# k +i,b+i = + —£^ 0 * 9 , ( 21 ) 

Z- — J <-y 

3 =° ' 

^i9,fc+l,b+l _ 9,k,b _ ^ ry\'d,k,b f j i k,b/ ( j i k,b^T-y\'&,k,b^ (22) 

rs,k,b 
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Algorithm 4 Monte Carlo simulation for calculating KG values 
Require: 6, 0 k, °, <x°, er 1 , a b 1 ,S fe ’ 6 , and Q 
1: for all x € X do 
2: surn x = 0 

3: for q = 1 to Q do 

4: for j = 0 to b do 

5: Generate a realization z q of Z k, i 

6: end for 

7: temp mav (Q k f + Y%=o °i' z q + <x(2 k ’ b , x)z b q ) 

8: sum x sum x + temp 

9: end for 

10: end for 
11: return sum/Q 


where e fc + 1 4 = y k + l J _ an d _ a 2 _|_ 

Furthermore, for the sparse linear model, the posterior distribution of the sparsity struc¬ 
ture parameter £ will be updated according to m and (HU) after the observations are 
revealed for the whole batch. Since ~ A/’('d fc , '£’ 0,k ), then given £, and can 

be updated according to m and . The batch SpKG algorithm works by greedily adding 
in each alternative that maximizes the expected marginal value until B alternatives are cho¬ 
sen given the sparsity structure unchanged within a batch. The KG value of measuring 
alternatives x\..... Xj at time step k can be computed as 


, BSpKG 
'Bk, 0 




=x (s k ) = Y J ^Ap Kl y. 


BKG 

fc ,0 


k, b= JS s V’ k ’ 1 ), 


1=1 


(23) 


where S Sp,k ’ 1 is the knowledge state given C k ’\ and S Sp,k ’ 1 = (4>*^,z) T ). 
After x k ’°, ..., x k,b ^ 1 are chosen, u B ^ 0 G x k,b-i x k,b = (S Sp,k ’ 1 ) can be approximated using Algo- 
rithm[4l Then the 6-th decision within a batch is given by x k,b = arg max x v^ k ^ K x k, b -i x k, b — L (S k ). 
The batch SpKG algorithm can be summarized in Algorithm [SJ 


3.3 The Batch SpKG Algorithm with Length Mutagenesis 

For all of the above algorithms, we use a fixed set of discrete alternatives X. For this 
RNA accessibility identification problem, if we consider the probe sequence with length of 
8~16, the number of alternatives can be ~4000. Including all of them in the alternative 
library would be computationally expensive while working with only a small subset would 
possibly miss the most accessible region. To compromise, we propose a novel procedure 
that we call length mutagenesis , which sequentially enlarges the probe library through an 
adaptive probe refinement procedure. We simply refer to this as “mutagenesis” throughout 
the remainder of the paper for compactness. The mutagenesis works as follows. Suppose 
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Algorithm 5 Batch sparse knowledge gradient policy for the k -th batch decision 
Require: anc | the number of sample Q for the Monte Carlo simulation 

1: Use Algorithm [2] to find 
2: for l = 1 to L do 


3: 


10 ,1 


4: Update according to 

5: end for 

6: for b = 1 to B — 1 do 


r k, 0 


) 


7: 


9: 

10 : 

11 : 

12 : 

13: 

14: 

15: 


for l = 1 to L do 
Use Algorithm[J]to calculate n B fe K 0 G 


, ^ k ' b = and a 


1 9,k,b 

c 


k.b— 1 ~k,b 


_ (S Sp,k ’ 1 ) with input parameters = 


end for 

Calculate ja .fc,b =3 . according to 

k b ’ BSpKG 

X ’ = arg max xe x v x k,o ; ___ ;X k,b-i ;X k,b =x 

for l = 1 to L do 

* b ' 1 <r- v(*^ k ^J b (*^ k ,l) T ,X k ’ b ) 

Update S^. fc / 6+1 according to 

end for 


r0 ,1 


6-1,Z 


16: end for 

17: return batch decisions x fc ’°,: 


1 


^,5-1 


that at time n we have a library of probes that we denote as 

X n = {xi,.. .,%»}■ 


For the next experiment, we could consider a larger library of sub-probes obtained through 
mutagenesis. With mutagenesis, we either add or delete nucleotides at one end of a probe. 
For now, let us think of each alternative x as a probe sequence representing a region in the 
target molecule. Specifically, given a probe x, we can alter it through a round of mutagenesis 
to get a new probe x' of the form 


x = 


,j] x' = 


[i + k,j ], 
[i + k,j], 


i + k < j 
i < j + k 


0 < \k\ < 7. 


Since a probe of length less than 4 does not necessarily bind to the correct targeted region 
experimentally, we limit the probe length to be no less than 4. Then for a probe x, let A4(x) 
denote the set of possible probes obtained from x through mutagenesis. At time n, we get 
an expanded library through mutagenesis, that is 

M n 

X n = X n u\jM( Xi ). 

1=1 

From this expanded library, we pick the alternative with the highest KG score, that is 


x 


n 


= arg max v 
x&X n 


KG,n 

x 
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Then we add this probe to our library if it is new, 

x n+1 = ru { x n }. 

This approach allows us to add a new probe which potentially has a higher fluoresence signal 
at each time and work dynamically with the alternative library. The computation is simpler 
and more efficient than maintaining all the possible alternatives. As shown in the simulations 
in Figure El we have a much better chance of obtaining information about the accessibility 
of the molecule relative to maintaining a fixed probe region. 


4 Empirical Study 

In this section, we present the simulation results of the SpKG algorithms for the RNA 
accessibility identification problem described in Section [2j The target RNA molecule is the 
Group I intron, a mid-size RNA molecule sequence with a length of 414. Due to the nature 
of this problem, we are not able to identify the 21 nucleotides at one end of the molecule 
sequence. Therefore, we only work with a sequence with length of 393. 

In Section 14.11 we describe the prior in vitro DMS footprinting data and the methods 
for generating the prior covariance matrix. In Section [1.21 we present the simulation results 
of the performance of SpKG on a collection of probe sequences as well as those of the batch 
SpKG algorithm and the batch SpKG with mutagenesis scheme. We also compare these 
policies with several other policies. 


4.1 Prior Distribution 


When choosing a prior distribution, the domain experts can have many ways to articulate 
their prior belief on the accessible regions. In this problem, we have the accessibility profile 
obtained from the in vitro DMS footprinting. The DMS footprinting is a standard chemical 
method to study RNA structure. It relies on the reactivity of a small molecule Di-methyl 
sulfate (DMS) with the base-pairing molecular faces of adenosines and cytidines (A and C). 
The higher the DMS reactivity is for a nucleotide site, the more the nucleotide is exposed. By 
reversely transcribing the DMS reacted RNA into DNA, we can determine sites of reaction 
and thus the levels of pr otecti on ex posur e at a single-nucleotide resolution. Here we use in 
vitro DMS data from 


Russell et al. 


2006] as an initial estimation of nucleotide accessibility. 
One may think of this dataset as providing the priors and (£?, rf -) for j = 1 ,,p. 

We now discuss how we generate the prior covariance matrix S’ 9,0 . For some previous 
work, this matrix is generated by taking the diagonal matrix with the variance from the 
measurement noise. This means we begin with independent beliefs. However, for this prob¬ 
lem, the weight accessibility coefficients have natural proximity correlations. As can be seen 
from Figure [2] (a), the value of the accessibility coefficients are quite close locally. In fact, 
if we plot the sample autocorrelation function as shown in Figure [2](b), we can see that the 
correlation is 0.4718 when the lag is 1, jumps in the interval [-0.1, 0.2] for lag until 100 
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and almost decays to 0 after 250. If we fit an exponential function y = e KX to the sample 
autocorrelation, the best fitted decay rate by least squares is n* = 0.39728. 
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Figure 2: (a) Prior Accessibility Coefficients of the In vitro DMS footprinting Data; (b) The 
Sample Autocorrelation Function for Prior Accessibility Coefficients 

Note. The fitted exponential decay function with a decay rate of 0.39728 is plotted in the 
blue dash line in (b). 

This proximity correlation detection is also consistent with the domain experts’ experi¬ 
ence that the probes tend to perform similarly within a window of up to ~40. Theoretically, 
the larger this window is, the less measurements we need to identify the accessibility pat¬ 
terns of the target RNA. This is because the advantage of our algorithm is to incorporate 
the covariance matrix E , so we can locally infer more information based on what we have 
learned. Taking advantage of this proximity correlation knowledge, we use the exponential 
covariance function to model the prior covariance E 1 ^ 0 . Under the exponential covariance 
functions, for any two points i and j from 1 ,,p, 


Corr= exp{-y|i - j\}, 
Var('dj) = pf, 


(24) 

(25) 


where 7 > 0 and (3i,... ,/3 p > 0 are hyperparameters chosen to reflect our belief. Here, /3* 


should be chosen to represent our confidence that is close to our chosen mean function. 


The 7 should reflect how quickly we believe E f- changes as i and j move further apart, with 
larger values of 7 suggesting more rapid change. This simple family of covariance functions 
produces Gaussian process priors that are stationary and thus can be used for modeling the 
accessibility coefficients in this problem. 


In practice, when one is unsure about the value of these hyperparameters, second-level 
priors can be put to model the coefficients 7 and /?,. However, instead of using these hierar¬ 


chical maximum a posteriori (MAP) approaches, we directly set up the values according to 


19 
















the prior in vitro DMS footprinting data and the domain experts’ experience. Specifically, 
we let 7 = k* = 0.39728 from the fitted decay rate of the in vitro DMS footprinting data 
as shown in Figure [2](b). Also, according to our domain experts, the noise ratio for esti¬ 
mating the accessibility coefficient is 10% ~ 15%, so we set the noise ratio r = 20% to be 
conservative. That is 


/3j = 20 % x i for j = 1 ,... ,p, 


(26) 


where 


for j : i) 3 % 0 
Ej=i dj/P for j : = 0 , 


(27) 


Combining (1241) . (1251) . and (1261) . we set the prior covariance matrix S ’ 9,0 as 

E ff = r2 3i#j ex P{-«*l*-j|}, for i,j = (28) 

where r = 20%, n* = 0.39728, and $i,... are from (1271) . 

Besides the prior covariance matrix, we also have to set the shape parameters (£j,rij) for 
the beta distribution (the frequency priors) in 0. For j = 1... ,p, we propose to set the 
frequency priors as 

f (1,1) + (w, 0 ), for j : / 0 
{ (1,1) + (0,W), for j : = 0 


Here w > 0 is a hyperparameter representing our confidence in the prior sparsity pattern. A 
smaller w reflects less confidence in the prior while a larger w represents more confidence. If 
at the end of the experiments our algorithm uses probability 0.5 as a threshold to choose the 
nonzero coefficients, then w should not be larger than the sampling budget N. Otherwise, 
the sparsity pattern of the posterior estimate is totally identical with the prior data no 
matter what Lasso estimates we get. In the following simulations, we treat w as a tunable 
parameter depending on either the good prior or the bad prior cases. 


4.2 Simulation Results 

Notice that for this RNA accessibility identification problem, the real accessibility profile 
is unknown, while we can approximately learn this through various experimental methods. 
In this paper, we perform various simulations in which we sample a truth from a stochastic 
process and then run this trial with this fixed truth for some fixed number of measurement 
budget N. Then we replicate this over several runs to assess the performance of various 
policies. This truth coefficient is usually sampled through both vertically perturbing the 
values of the prior coefficient by a normally distributed random variable, and horizontally 
rotating the prior along the RNA molecule. In this section, we show the simulation results 
for all of the algorithms demonstrated in Section [31 
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4.2.1 Results using the SpKG Algorithm 

We begin by describing the results for the SpKG policy presented in Algorithm [2j In 
this simulation, as suggested by the scientists on the team, we try all the probes of length 10 
with 3 overlaps for the adjacent ones. Then we have M = 55 number of alternatives. In this 
setting, we compare Algorithm [2] with two other policies: a pure exploration policy, which 
randomly chooses an alternative to test at each time, and Algorithm [1] which uses KG with 
a nonsparse, linear belief model. It is worth emphasizing that for the pure exploration policy 
here, although it does not use the sparse linear structure to make measurement decisions, it 
still updates the belief in the same way as that of SpKG. 

In this simulation, we generate the true accessibility coefficient vector from a multivariate 
normal distribution with mean i?°, with the covariance matrix being the same form as (1281) . 
The differences are the noise ratio r is chosen to be as large as 10, and k is drawn from a 
normal distribution with mean k* and standard deviation 0.1. Then we take this vector and 
right circularly shift it by an integer value uniformly sampled from 20 to 50 at each time. 
Since the prior has now been significantly altered from the truth, we believe it is not a good 
prior and set w = 10 with the measurement budget N = 100. To quantitatively measure 
the performance of different policies, we consider the opportunity cost (OC), defined as the 
difference between the true value of the alternative that is actually the best and the true 
value of the alternative that is the best according to the policy’s posterior belief distribution, 
i.e., 


OC(n) = n(x*) — , 

where x* is the true optimal alternative, and x n ’* is the estimated optimal alternative at 
time n. So OC describes how far from optimal the current estimate of the optimal solution 
is after each experiment and and thus can serve as a metric for the performance of a specific 
decision policy. For illustrative purposes, we also consider the percentage OC with respect 
to the optimal value, 

fi(x*) — n(x n ’*) 

This normalized representation is unit-free and better illustrates how far in percentage we 
are from the optimal. 

Figure [3] plots the average percentage OC on a log scale over 100 runs for three dif¬ 
ferent policies in both low and high noise settings. The standard deviations of noise are 
10% and 50% of the expected range of the truth. As both figures show, compared with 
pure exploration, SpKG results in a significant reduction in the opportunity cost, although 
the exploration policy also takes advantage of the sparse linear structure for the Bayesian 
implementation. When comparing with KG for a nonsparse, linear belief model, SpKG out¬ 
performs with lower average opportunity costs most of the time. However, during the initial 
stage when there are less than five measurements, SpKG behaves no better than KGLin, 
especially in low noise settings. This is because it takes several samples for Lasso to iden- 
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Figure 3: Average Opportunity Cost Comparing Policies with Low and High Measurement 
Noise 


Notes. This simulation is for the whole target molecule sequence. The alternative probes 
are of length 10 with 3 overlaps for the adjacent ones. To better visualize the difference in 
OC, the average percentage OC is plotted on a log scale over 100 simulation trials. 


tify the true support. When it finds the sparsity pattern, SpKG is able to find alternatives 
converging to the truth much more efficiently than other policies. 

4.2.2 Results using the Batch SpKG Algorithm 

In this simulation, we try testing the batch SpKG algorithm described in Algorithm [5] 
For the real experiments, the experimentalist is able to synthesize three probes to test the 
fluorescence intensities in parallel at each time. So in the batch setting, we let B = 3. From 
now on, let us take a specific region of the RNA molecule from site 95 to site 251, as this region 
is suggested by the domain experts to be the most promising sub-sequence. We try a larger 
set of probes than before: all 8-long probes shifted by 4, all 12-long probes shifted by 6, and 
all 16-long probes shifted by 8. That is we take all the regions [4& + 1, 4fc + 8], [6& + 1, 6/c +12], 
and [8k + 1,8k + 16] starting from 95 and ending at 251. Beside these, we also include ten 
other probes as suggested by the domain experts: [98,112], [113,126], [127,140], [141,155], 
[156,170], [171,179], [179,194], [195, 214], [215, 233], and [234, 251]. In total, we have M = 91 
number of alternatives. We generate the true accessibility profile in the same way as before. 

First, we illustrate how the batch SpKG policy works under a measurement noise of 30%. 
For one such simulated truth, we depict the batch SpKG value initially, after one, and two 
batch measurements, respectively in Figure[4j For these figures, we only include those probes 
with batch SpKG values above the mean to better visualize the KG scores. As indicated by 
the arrows, for the probes with the largest batch SpKG scores, the KG scores drop after they 
have been measured. As we only plot those with KG scores above average, some probes with 
high KG scores in Figure 0|[a) have the scores dropped below average after being measured 
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and are therefore not shown in Figure H](b). This observation is also consistent with our 
intuition of SpKG as a measure of the value of information, and thus we can use this policy 
as a guideline to pick the next experiments. 


(a) No measurement 


(b) After 1 batch measurement 


(c) After 2 batch measurements 



Nucleotide position 


Nucleotide position 


Nucleotide position 


Figure 4: Batch SpKG Values Before and After 1 and 2 Batch Measurements with Noise 
Ratio of 30%. 

Notes. This simulation is for a selected set of probes ranging from site 95 to site 251. Each 
bar is a potential range of a probe. The vertical axis is the batch SpKG score. Only those 
probes with the batch SpKG score above the mean are plotted. The arrows indicate the 
decreases in KG values for the probes that were previously measured. Note that some of 
these are not shown since they have KG values below average. 


Furthermore, for one simulated truth, we also plot the estimates of accessibility profiles 
(coefficients) after 5, 10, 15, and 20 batch measurements with a noise ratio of 20% in Figure 

El 

As one can see from Figure El after five batch measurements, the estimate is still closer 
to the prior than the truth. After 10 batch measurements, we have discovered many of the 
accessible regions. After 15 batch measurements, we have not only discovered the location 
of the accessible regions, but also obtained good estimates for the actual accessibility value. 
And after 20 batch measurements, our estimate closely matches the truth. At last, we also 
try different noise levels 20%, 30%, 40%, and 50% and repeatedly run such simulations for 200 
times for each level. The averaged percentage OC and estimation error are plotted in Figure 
El Here the normalized estimation error is the £2 distance between the estimated coefficient 
and the truth divided by the length of the coefficient vector, which is 157 currently. 

Experimentally, fluorescent measurements are made by performing induction assays on 
prepared cell cultures. For each cell culture prepared, several samples are obtained, and 
fluorescence measurements are made via flow cytometry. Measurement dispersity in a small 
number of samples can be as large as 15% ~ 20% in standard deviation. For this noise level, 
we can see from the figure that most locations of the highly accessible regions can be found 
after about 25 observations. However, note that the true accessibility profiles sampled for 
the simulations are perturbed by a large amount. In reality, we suspect the truth to be 
more in agreement with the prior footprinting data, which implies better performance by 
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(a) Accessibility profile estimate after 5 batch measurement 



(b) Accessibility profile estimate after 10 batch measurement 



(c) Accessibility profile estimate after 15 batch measurement 



(d) Accessibility profile estimate after 20 batch measurement 



Figure 5: Accessibility Profile Estimate by the Batch SpKG Algorithm After 5, 10, 15, and 
20 Batch Measurements with Noise Ratio of 20% 


Note. This is for the accessibility coefficient ranging from site 95 to site 251. 


the SpKG algorithm. 

4.2.3 Results using the Batch SpKG Algorithm with Length Mutagenesis 

In this set of experiments, we use the same set of probes in Section 14.2.21 as the initial 
alternative library. The simulations are run with two different priors: a good prior and a bad 
prior. The bad prior is the one used in the above experiments, which is obtained by doing 
both vertical perturbation and horizontal shift from the in vitro DMS footprinting data. For 
the good prior, we only do vertical perturbation with the same amount, so we would think 
of the sparsity pattern of the good prior more proximal to the truth. Therefore, for this set 
of simulations with L = 20, B = 3, we set w = 10 for the bad prior and w = 20 for the good 
prior. 

Figure [7] compares the batch SpKG algorithm with and without mutagenesis for both 
good and bad priors averaged over 300 simulation trials. It shows the mean percentage 
opportunity cost as a function of measurements and measurement noise errors. We try ten 
different noise levels from 10% to 100%. For all of the figures, we observe that the OC% 
decays as the number of measurements increases and the measurement noise decreases. Such 
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(a) Opportunity cost of batch SpKG 


(b) Estimation error of batch SpKG 




Measurement 


Measurement 


Figure 6: (a) Averaged Percentage Opportunity Cost of Batch SpKG with Various Noise Ra¬ 
tios; (b) Averaged Normalized 1 2 Estimation Error of Accessibility Coefficient with Various 
Noise Ratios 


Note. Both figures are for replicated runs averaged over 200 times. 


plots can be useful in experimental budgeting and show the required number of measurements 
needed to obtain a certain level of optimality for some noise level. 

Comparing Figure [7(a) (b) with (c)(d), we can see that the OC% decreases much faster 
in the good prior cases as a function of the number of experiments, as expected. Com¬ 
paring Figure [7] (a)(c) with (b)(d), we find that the OC% generated from the mutagenesis 
procedures tends to be smaller than those without mutagenesis. This is because when we 
add a new probe with the largest KG score into the alternative library with mutagenesis, 
we often add the one with a higher fluorescence level than the current fluorescences. In 
other words, the true highest fluorescence level is increasing as a new probe is added into 
the library. In such cases, it would be more intuitive to see how the highest fluorescence 
is varying over time. Figure [8] provides a more illustrative explanation for how the actual 
highest fluorescence changes. For this set of figures, we compare how the actual fluorescence 
changes over measurement. The red star line is the true highest fluorescence, and the blue 
solid line is the value of true fluorescence by estimation. So the difference between the two 
lines is the OC. We compare three different policies: batch SpKG, batch SpKG with mu¬ 
tagenesis, and exploration mutagenesis. Exploration mutagenesis involves randomly adding 
new probes if they are not in the current library. From Figure El we can see that SpKG with 
mutagenesis has the ability to find new probes with fluorescence values about three to five 
times the highest values in the initial set. However, for exploration mutagenesis, the highest 
fluorescence improves less, as expected. This also proves the power of the KG policy to iden¬ 
tify the potential alternative that can outperform the current optimal one. Furthermore, it 
is also worth noting that with mutagenesis, the value of the true fluorescence through the 
Bayesian estimation is pretty close to the truth. That means the OC is close to zero, which 
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is consistent with Figure E^b) (d). 


(a) OC of batch SpKG for bad priors 


(b) OC of batch SpKG with mutagenesis for bad priors 



0.2 0.4 0.6 0.8 1 


Measurement noise error 



(c) OC of batch SpKG for good priors 


(d) OC of batch SpKG with mutagenesis for good priors 



Measurement noise error 


Measurement noise error 


Figure 7: Averaged Percentage Opportunity Cost of Batch SpKG with and without Muta¬ 
genesis for Good and Bad Priors over 300 Runs 

Note. The contour plots show averaged percentage OC as a function of measurements and 
measurement noise errors. 


5 Conclusion 


Identifying the accessibility pattern of an RNA molecule is an important topic in molec¬ 
ular biology. On one hand, the real experimental study is a long and expensive process 
for which adaptive learning procedures like the knowledge gradient policy are well suited to 
make experimental decisions. On the other hand, this problem naturally incorporates spar¬ 
sity linear structure and thus requires more sophisticated statistical techniques to analyze the 
underlying model. To better helpjearn the acce ssibility profile of the RNA molecule, we use 
a recently derived SpKG policy Li et ah, 120151 ], which is a novel hybrid of Bayesian Rank¬ 


ing <fc Selection and frequentist penalized regression approach called Lasso. This optimal 
learning algorithm has been shown to efficiently identify the accessibility pattern and learn 
the underlying sparsity structures. Controlled experiments also show that it outperforms 
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Measurement 


Measurement 


Measurement 



Figure 8: True Highest Log Fluorescence and Estimated Highest Fluorescence for Good and 
Bad Priors Comparing Three Policies: Batch SpKG, Batch SpKG with Mutagenesis, and 
Exploration Mutagenesis 

Notes. These plots show the values in true fluorescence the highest and the estimated 
highest. The difference between the two lines is the OC. The highest fluorescence for batch 
SpKG policy remains constant since we maintain the same probe alternative library 
throughout measurements. With mutagenesis, most of the time we could find the probe 
with higher fluorescence than before. For batch SpKG with mutagenesis, the new probe 
with the largest KG score is added. For exploration mutagenesis, the new probe is 
randomly added. The results are averaged over 300 runs. 


several other policies. 

Algorithmically, we also entend the SpKG policy into a general framework for batch 
mode learning, where the experimenter can run several experiments in batch. Empirical 
studies demonstrate the effectiveness of this policy for various experimental setups. Besides, 
we also derive the batch SpKG algorithm using length mutagenesis to expand the set of 
alternatives. In this procedure, the alternative library is adaptively enlarged as the most 
promising alternative is added in at each time. Controlled experiments also demonstrate its 
efficiency in identifying the accessibility pattern of the RNA molecule. 

In conclusion, it is worth noting that the SpKG algorithm has only been applied to sparse 
linear beliefs. Possible future directions of the work would include the study of more general 
nonlinear beliefs that incorporate sparsity structure. Despite this limitation, we still believe 
the SpKG algorithm would allow efficient implementation for large data sets, and we would 
like to suggest this algorithm for solving more general application problems with sparse linear 
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beliefs. 
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